Unsupervised Morphological Disambiguation using Statistical Language Models
نویسندگان
چکیده
In this paper, we present a probabilistic model for the unsupervised morphological disambiguation problem. Our model assigns morphological parses T to the contexts C instead of assigning them to the words W . The target word w ∈ W determines the possible parse set Tw ⊂ T that can be used in w’s context cw ∈ C. To assign the correct morphological parse t ∈ Tw to w, our model finds the parse t ∈ Tw that maximizes P (t|cw). P (t|cw)’s are estimated using a statistical language model and the vocabulary of the corpus. The system performs significantly better than an unsupervised baseline and its performance is close to a supervised baseline.
منابع مشابه
Combining Hand-crafted Rules and Unsupervised Learning in Constraint-based Morphological Disambiguation
This paper presents a constraint-based morphological disambiguation approach that is applicable languages with complex morphology-specifically agglutinative languages with productive inflectional and derivational morphological phenomena. In certain respects, our approach has been motivated by Brill's recent work (Brill, 1995b), but with the observation that his transformational approach is not ...
متن کاملKnowledge-Rich Morphological Priors for Bayesian Language Models
We present a morphology-aware nonparametric Bayesian model of language whose prior distribution uses manually constructed finitestate transducers to capture the word formation processes of particular languages. This relaxes the word independence assumption and enables sharing of statistical strength across, for example, stems or inflectional paradigms in different contexts. Our model can be use...
متن کاملThe tÜBITAK-UEKAE statistical machine translation system for IWSLT 2009
We describe our Arabic-to-English and Turkish-to-English machine translation systems that participated in the IWSLT 2009 evaluation campaign. Both systems are based on the Moses statistical machine translation toolkit, with added components to address the rich morphology of the source languages. Three different morphological approaches are investigated for Turkish. Our primary submission uses l...
متن کاملUnsupervised Resolution of Acronyms and Abbreviations in Nursing Notes Using Document-Level Context Models
Automatic simplification of clinical notes continues to be an important challenge for NLP systems. A frequent obstacle to developing more robust NLP systems for the clinical domain is the lack of annotated training data. This study investigates unsupervised techniques for one key aspect of medical text simplification, viz. the expansion and disambiguation of acronyms and abbreviations. Our appr...
متن کاملAn Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation
Morphological disambiguation is the process of assigning one set of morphological features to each individual word in a text. When the word is ambiguous (there are several possible analyses for the word), a disambiguation procedure based on the word context must be applied. This paper deals with morphological disambiguation of the Hebrew language, which combines morphemes into a word in both ag...
متن کامل